Code
# Load required libraries and data preparation scripts
source("R_files/01_libraries.R")
source("R_files/02_data.R")This project analyses time series patterns of bovine Tuberculosis (bTB) cases in Ireland. (more details)
The main objectives are:
1. To describe overall trends and seasonal patterns of bTB incidence.
2. To compare different time series forecasting models.
3. To explore subgroup dynamics by herd type and test type.
4. To provide insights that may support monitoring and policy decisions.
Monthly counts of bovine Tuberculosis (bTB) cases in Ireland were compiled for the period 2008–2024 (add year auto). A case was defined as any animal testing positive under one of the three diagnostic procedures routinely implemented during the study period. Data were aggregated to obtain national-level monthly totals. Records with missing or “unknown” herd or test classifications were excluded from subgroup analyses but retained in the total-case series.
1- Data preparation Cases were aggregated by month across all test types. In subgroup analyses, incomplete herd/test categories were excluded.
2- Exploratory analysis Long-term trends and seasonal variation were visualized. Seasonal-trend decomposition using locally estimated scatterplot smoothing (STL) was applied to separate the trend, seasonal, and irregular components.
3- Time-Series Modeling The dataset was split into a training set (2008–2023) and a test set (2024) to evaluate out-of-sample predictive performance. Candidate models included autoregressive integrated moving average (ARIMA) and exponential smoothing state space models (ETS), each fitted with and without log transformation. Model performance was evaluated using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Residual independence was assessed via Ljung–Box tests. Time-series cross-validation was employed to compare forecasting accuracy across candidate models.
4- Forecasting Using models fitted on the 2008–2023 training period, forecasts were generated 12 months ahead for 2024 at the national total-case level (add auto). Subgroup forecasts stratified by herd type (Dairy, Beef Store + Fattener, Mixed) were also produced. Hierarchical reconciliation methods (bottom-up, ordinary least squares [OLS], and minimum trace [MinT]) were applied to ensure coherence between subgroup and aggregate forecasts.
5- Software : All analyses were performed in R version 4.4.2 (R Core Team, 2024). Time-series methods were implemented using the forecast and fpp3 packages.
# Load required libraries and data preparation scripts
source("R_files/01_libraries.R")
source("R_files/02_data.R")This figure shows the overall monthly trend of bTB cases. A declining pattern is observed until 2015, followed by an increase from 2018 onward.
# Calculate average number of cases per month
avg_per_month <- cases_per_month %>%
group_by(month) %>%
summarise(avg_cases = mean(num_cases), .groups = "drop")
# Plot overall trend of bTB cases over time
ggplot(cases_per_month, aes(x = year_month, y = num_cases)) +
geom_line() +
geom_point() +
labs(title = "Trend bTB Cases", x = "Month", y = "Number of cases") +
theme_minimal()This visualization highlights seasonal variation by month across years.
cases_per_month <- cases_per_month %>%
mutate(month = factor(month, levels = 1:12, labels = month.abb))
avg_per_month <- avg_per_month %>%
mutate(month = factor(month, levels = 1:12, labels = month.abb))
# Plot seasonality of bTB cases by month
ggplot(cases_per_month, aes(x = year, y = num_cases, group = 1)) +
geom_line() +
geom_point(size = 1) +
geom_hline(
data = avg_per_month,
aes(yintercept = avg_cases),
linetype = "solid",
color = "blue"
) +
facet_grid(. ~ month, scales = "free_x", space = "free") +
labs(
title = glue::glue(
"Seasonality of bTB by month (2008-{max(cases_per_month$year)})"
),
#add years auto
x = "Year",
y = "No. of bTB cases"
) +
theme_minimal() +
theme(
strip.background = element_rect(fill = "grey90", color = NA),
strip.text = element_text(size = 12, face = "bold"),
axis.text.x = element_text(
angle = 90,
hjust = 1,
face = "bold"
)
)# Convert data to tsibble format
cases_tsibble <- cases_per_month %>%
mutate(year_month = yearmonth(year_month)) %>%
as_tsibble(index = year_month)
# Apply STL decomposition
decomp <- cases_tsibble %>%
model(STL(num_cases ~ season(window = "periodic"))) %>%
components()
# Plot decomposition
autoplot(decomp) +
labs(
title = glue::glue("STL Decomposition of bTB Cases (2008–{max(cases_per_month$year)})"),
x = "Year",
y = "Number of cases"
) +
theme_minimal()This section presents the time series modeling and forecasting for bTB cases in Ireland. All models were fitted previously in an external R script and saved for reproducibility (add link)
The figure below compares the 12-month forecast against observed 2024 values.
##Forecast vs Observed Cases
# Load pre-saved forecast plot for 2024
knitr::include_graphics("figures/forecast_plot.png")The table summarises model performance. The log-transformed ARIMA achieved the best fit, indicated by lower AIC and error values, and passed the Ljung–Box test.
# Load fitted models and cross-validation accuracy
fit_models <- readRDS("results/fit_models.rds")
cv_accuracy_summary <- read_csv("results/cv_accuracy_summary.csv")
# Extract model metrics (AIC, BIC, sigma^2)
metrics <- fit_models %>%
glance() %>%
select(.model, AIC, BIC, sigma2)
# Combine metrics with cross-validation accuracy
final_summary <- left_join(metrics, cv_accuracy_summary, by = ".model")
# Perform Ljung-Box test to check residual independence
res_ets <- fit_models %>% select(ets) %>% augment()
res_ets_log <- fit_models %>% select(ets_log) %>% augment()
res_arima <- fit_models %>% select(arima) %>% augment()
res_arima_log <- fit_models %>% select(arima_log) %>% augment()
p_values <- tibble(
Model = c("ets", "ets_log", "arima", "arima_log"),
`p-value Ljung-Box` = c(
Box.test(res_ets$.resid, lag = 20, type = "Ljung-Box")$p.value,
Box.test(res_ets_log$.resid, lag = 20, type = "Ljung-Box")$p.value,
Box.test(res_arima$.resid, lag = 20, type = "Ljung-Box")$p.value,
Box.test(res_arima_log$.resid, lag = 20, type = "Ljung-Box")$p.value
)
)
# Merge model metrics with Ljung-Box p-values
final_table <- head(left_join(final_summary, p_values, by = c(".model" = "Model")),-2)
final_table %>%
kable(caption = "Model Comparison Metrics", digits = 2,
col.names = c("Model", "AIC", "BIC","sigma2","RMSE", "MAPE", "MAE", "p-value Ljung-Box")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))| Model | AIC | BIC | sigma2 | RMSE | MAPE | MAE | p-value Ljung-Box |
|---|---|---|---|---|---|---|---|
| arima | 2553.07 | 2565.82 | 86912.82 | 313.34 | 14.24 | 240.82 | 0.00 |
| arima_log | -161.83 | -149.08 | 0.02 | 289.09 | 12.74 | 219.06 | 0.05 |
| ets | 3171.66 | 3220.52 | 0.02 | 304.54 | 13.89 | 237.92 | 0.03 |
| ets_log | 295.94 | 344.81 | 0.02 | 306.08 | 13.52 | 232.80 | 0.03 |
library(plotly)
train <- cases_tsibble %>%
filter(year_month <= yearmonth("2023 Dec"))
#interactive graph
## Forecast extendido para proyección (2025–2026) ---
fc_long <- fit_models %>%
forecast(h = 36) # 36 meses desde enero 2024 = 2024–2026
## Plot largo (2024–2026, solo proyección más datos reales) ---
forecast_plot_long <- autoplot(fc_long, train) +
autolayer(cases_tsibble, colour = "black") +
labs(title = "Forecast 2024–2026 vs Actual", y = "Cases")
ggsave("figures/forecast_plot_long.png", forecast_plot_long,
width = 10, height = 6)
## --- Forecast data ---
fc_long_clean <- fc_long %>%
as_tibble() %>%
rename(
Model = .model,
Cases = .mean,
`Year-Month` = year_month
)
## --- Observed data ---
obs_clean <- cases_tsibble %>%
as_tibble() %>%
rename(
`Year-Month` = year_month,
Cases = num_cases
)
## --- Plot ---
forecast_plot_long <- ggplot(fc_long_clean, aes(x = `Year-Month`, y = Cases, color = Model)) +
geom_line(size = 0.5) + # grosor más fino para forecast
geom_line(data = obs_clean, aes(x = `Year-Month`, y = Cases), color = "black", size = 0.5) + # Observed en negro
scale_y_continuous(limits = c(0, 10000)) +
labs(
title = "Forecast of Cases 2024–2026",
y = "Number of Cases",
x = "Year-Month"
)
## --- Interactive ---
forecast_plot_long_interactive <- ggplotly(
forecast_plot_long,
tooltip = c("Model", "Cases", "Year-Month")
)
forecast_plot_long_interactive#2.4 Detailed reports for best models
# Detailed report for the best model (Arima with log transformation)
model_info <- fit_models %>% select(arima_log) %>% report()Series: num_cases
Model: ARIMA(1,1,1)(0,1,1)[12]
Transformation: log(num_cases)
Coefficients:
ar1 ma1 sma1
-0.1065 -0.6149 -0.6371
s.e. 0.1267 0.1114 0.0667
sigma^2 estimated as 0.02219: log likelihood=84.91
AIC=-161.83 AICc=-161.6 BIC=-149.08
The best-performing model for the monthly case series was a <ARIMA(1,1,1)(0,1,1)[12]> model.
# Filter out rows with missing or unknown herd types and test types
cases_plot <- all_cases_collapsed %>%
filter(!is.na(herd_type_ml_description),
herd_type_ml_description != "Unknown",
!is.na(best_estimate_gif_skin_lab_string)) %>%
filter(year != 2025) %>%
group_by(year_month, herd_type_ml_description, best_estimate_gif_skin_lab_string) %>%
summarise(num_cases = n(), .groups = "drop") %>%
mutate(
year_month = as.Date(paste0(year_month, "-01")),
series_id = paste(herd_type_ml_description, best_estimate_gif_skin_lab_string, sep = "_")
) %>%
as.data.frame() # make sure it's a regular data.frame for ggplot# Plot bTB cases by herd type and test type
ggplot(cases_plot, aes(x = year_month, y = num_cases, color = series_id)) +
geom_line(size = 1) +
geom_point(size = 1) +
facet_grid(herd_type_ml_description ~ best_estimate_gif_skin_lab_string, scales = "free_y") +
labs(
title = "bTB Cases by Herd Type and Test Type",
x = "Month",
y = "Number of cases"
) +
theme_minimal() +
theme(
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none" # remove legend
)# Prepare herd-level data for hierarchical modeling
# - Filter only relevant herd types
# - Combine "Beef Store" and "Fattener" into a single category
# - Fill gaps with zeros for months without cases
cases_herd <- all_cases_collapsed %>%
filter(!is.na(herd_type_ml_description),
herd_type_ml_description != "Unknown",
herd_type_ml_description %in% c("Dairy", "Beef Store", "Fattener", "Mixed"),
!is.na(best_estimate_gif_skin_lab_string)) %>%
mutate(
herd_type_ml_description = case_when(
herd_type_ml_description %in% c("Beef Store", "Fattener") ~ "Beef Store + Fattener",
TRUE ~ herd_type_ml_description
),
year_month = yearmonth(paste0(year_month, "-01"))
) %>%
group_by(year_month, herd_type_ml_description) %>%
summarise(num_cases = n(), .groups = "drop") %>%
as_tsibble(index = year_month, key = herd_type_ml_description) %>%
fill_gaps(num_cases = 0)
# Plot overall herd-level trends
cases_plot <- cases_herd %>%
as_tibble() %>%
mutate(series_id = herd_type_ml_description)
ggplot(cases_plot, aes(x = year_month, y = num_cases, color = series_id)) +
geom_line(size = 1) +
geom_point(size = 1) +
labs(title = "bTB Cases by Herd Type", x = "Month", y = "Number of cases") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 90, hjust = 1))Hierarchical forecasting ensures that subgroup forecasts (by herd type) add up consistently to the total forecast.
# Split training data up to Dec 2023
train_2023 <- cases_herd %>% filter(year_month <= yearmonth("2023 Dec"))
# Create hierarchical structure for forecasting
cases_hierarchy_train <- train_2023 %>%
aggregate_key(herd_type_ml_description, num_cases = sum(num_cases))
# Fit ARIMA models and reconcile forecasts
fit_reconciled_train <- cases_hierarchy_train %>%
model(
arima_log = ARIMA(log(num_cases))
) %>%
reconcile(
bu = bottom_up(arima_log),
ols = min_trace(arima_log, method = "ols"),
mint = min_trace(arima_log, method = "mint_shrink")
)
# Forecast 12 months for 2024
fc_2024 <- fit_reconciled_train %>% forecast(h = "12 months")
# Prepare actual data and aggregate total series for comparison
actual_2024 <- cases_herd %>%
filter(year_month >= yearmonth("2024 Jan") & year_month <= yearmonth("2024 Dec"))
total_2024 <- actual_2024 %>%
as_tibble() %>%
group_by(year_month) %>%
summarise(num_cases = sum(num_cases), .groups = "drop") %>%
mutate(herd_type_ml_description = "Total") %>%
as_tsibble(index = year_month, key = herd_type_ml_description)
actual_2024_full <- bind_rows(actual_2024, total_2024) %>%
as_tsibble(index = year_month, key = herd_type_ml_description)
#rename aggregated as "Total"
cases_hierarchy <- cases_hierarchy_train %>%
as_tibble() %>%
mutate(
herd_type_ml_description = ifelse(
is_aggregated(herd_type_ml_description),
"Total",
as.character(herd_type_ml_description)
)
) %>%
as_tsibble(index = year_month, key = herd_type_ml_description)
fc_2024 <- fc_2024 %>%
mutate(
herd_type_ml_description = ifelse(
is_aggregated(herd_type_ml_description),
"Total",
as.character(herd_type_ml_description)
)
)# Set factor levels so "Total" appears first
fc_2024 <- fc_2024 %>%
mutate(
herd_type_ml_description = factor(
herd_type_ml_description,
levels = c("Total", sort(setdiff(unique(herd_type_ml_description), "Total")))
)
)
cases_hierarchy <- cases_hierarchy %>%
mutate(
herd_type_ml_description = factor(
herd_type_ml_description,
levels = c("Total", sort(setdiff(unique(herd_type_ml_description), "Total")))
)
)
# Plot forecasts for each herd type
autoplot(fc_2024, cases_hierarchy, level = NULL) +
labs(
title = "Forecast by Herd Type (incl. Total)",
y = "Number of cases"
) +
theme_minimal() +
facet_wrap(~herd_type_ml_description, scales = "fixed") +
theme(
strip.text = element_text(face = "bold", size = 12),
axis.text.x = element_text(angle = 90, hjust = 1),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
)The table below reports RMSE and MAE by herd type for 2024 forecasts. Accuracy is highest for Dairy herds and lower for mixed herds.
# Calculate accuracy metrics (RMSE and MAE) for 2024 forecasts
accuracy_2024 <- fc_2024 %>%
accuracy(data = actual_2024_full) %>%
mutate(
Level = herd_type_ml_description
) %>%
select(.model, Level, RMSE, MAE)
# Arrange table and display with styling
accuracy_2024 %>%
mutate(Level = if_else(Level == "Total", "Total", Level)) %>%
select(.model, Level, RMSE, MAE) %>%
arrange(factor(Level, levels = c("Total", "Dairy", "Beef Store+Fattener", "Mixed")), RMSE) %>%
kable(
caption = "Forecast Accuracy Metrics by Herd Type",
digits = 2,
col.names = c("Model", "Herd Type", "RMSE", "MAE")
) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))| Model | Herd Type | RMSE | MAE |
|---|---|---|---|
| ols | Total | 351.77 | 250.07 |
| arima_log | Total | 352.32 | 248.21 |
| mint | Total | 352.48 | 251.64 |
| bu | Total | 360.90 | 266.18 |
| mint | Dairy | 322.65 | 258.05 |
| ols | Dairy | 324.81 | 260.75 |
| arima_log | Dairy | 333.70 | 270.45 |
| bu | Dairy | 333.70 | 270.45 |
| ols | Mixed | 131.73 | 87.91 |
| mint | Mixed | 131.79 | 87.99 |
| arima_log | Mixed | 133.08 | 88.19 |
| bu | Mixed | 133.08 | 88.19 |
| mint | Beef Store + Fattener | 144.97 | 125.77 |
| arima_log | Beef Store + Fattener | 145.37 | 126.26 |
| bu | Beef Store + Fattener | 145.37 | 126.26 |
| ols | Beef Store + Fattener | 158.65 | 140.24 |
#4. Summary and Conclusions